Data Provenance: A Categorization of Existing Approaches

نویسندگان

  • Boris Glavic
  • Klaus R. Dittrich
چکیده

In many application areas like e-science and data-warehousing detailed information about the origin of data is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item includes information about the processes and source data items that lead to its creation and current representation. The diversity of data representation models and application domains has lead to a number of more or less formal definitions of provenance. Most of them are limited to a special application domain, data representation model or data processing facility. Not surprisingly, the associated implementations are also restricted to some application domain and depend on a special data model. In this paper we give a survey of data provenance models and prototypes, present a general categorization scheme for provenance models and use this categorization scheme to study the properties of the existing approaches. This categorization enables us to distinguish between different kinds of provenance information and could lead to a better understanding of provenance in general. Besides the categorization of provenance types, it is important to include the storage, transformation and query requirements for the different kinds of provenance information and application domains in our considerations. The analysis of existing approaches will assist us in revealing open research problems in the area of data provenance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Provenance-Integration Framework for Distributed Workflows in Grid Environments

Provenance information about complex and distributed workflows is a key issue for data quality control and data reliability maintenance in reservoir management. Distributed and integrated environments where different workflows consume and transform data require a comprehensive provenance view. In this scenario provenance collection and integration presents significant challenges. In this paper,...

متن کامل

Provenance in Manually Curated Databases

Many curated databases are constructed by scientists integrating various existing data sources. Most current approaches to provenance in databases are based on views and fail to take account of the added value of the work done by scientists in manually creating and modifying data. Capturing provenance in such an environment is a challenging problem, requiring changes in practice, changes to exi...

متن کامل

Exploring Scientific Workflow Provenance Using Hybrid Queries over Nested Data and Lineage Graphs

Existing approaches for representing the provenance of scientific workflow runs largely ignore computation models that work over structured data, including XML. Unlike models based on transformation semantics, these computation models often employ update semantics, in which only a portion of an incoming XML stream is modified by each workflow step. Applying conventional provenance approaches to...

متن کامل

A Hierarchical Framework for Provenance Based on Fragmentation and Uncertainty

In the recent past, provenance – a research field that can be used to determine the origin and derivation history of data – has gained much attention [12]. Additionally, provenance is an important field for validating computation results. It covers coarse grained data as well as very fine grained information showing details of the implementation. Moreover, it is highly related to different topi...

متن کامل

Big Data Provenance: Challenges and Implications for Benchmarking

Data Provenance is information about the origin and creation process of data. Such information is useful for debugging data and transformations, auditing, evaluating the quality of and trust in data, modelling authenticity, and implementing access control for derived data. Provenance has been studied by the database, workflow, and distributed systems communities, but provenance for Big Data whi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007